immunarch is an R package designed to analyse TCR and BCR (immunoglobulin) repertoires, which constitute a large amount of data. The mission of immunarch is to make immune sequencing data analysis as effortless as possible—and help you focus on research instead of coding.
Fast and easy manipulation of immune repertoire data:
The package automatically detects the format of your files—no more guessing what format is that file, just pass them to the package;
Supports all popular TCR and BCR analysis and post-analysis formats: ImmunoSEQ, IMGT, MiTCR, MiXCR, MiGEC, MigMap, VDJtools, tcR. More coming in the future;
Works on any data source you are comfortable with: R data frames, data tables from data.table, databases like MonetDB, Apache Spark data frames via sparklyr.
Immune repertoire analysis made simple:
Most methods are incorporated in a couple of main functions with clear naming—no more remembering tens and tens of functions with obscure names;
Repertoire overlap analysis (common indices including overlap coefficient, Jaccard index and Morisita’s overlap index)
Gene usage estimation (correlation, Jensen-Shannon Divergence, clustering)
Diversity evaluation (ecological diversity index, Gini index, inverse Simpson index, rarefaction analysis)
Coming in the next releases: CDR3 amino acid physical and chemical properties assessment, Kmer distribution measures and statistics, mutation networks, tracking clonotypes across time points
Publication-ready plots with a built-in tool for tweaking visualisations:
Rich visualisation procedures with ggplot2;
Built-in tool fixVis makes your plots publication-ready: easily change font sizes, text angles, titles, legends and many more with clear-cut GUI;
You can find the list of releases of immunarch here: https://github.com/immunomind/immunarch/releases
In order to install immunarch, you need to download it first. If you want to download the latest version, you need to download the package file, available here https://github.com/immunomind/immunarch/releases/download/latest/immunarch.tar.gz
Note that You should not un-archive it!
After downloading the file, you need to install a number of packages with R commands listed below, and run the newly installed devtools package to install immunarch locally. Upon completion the dependencies will have been already downloaded and installed.
install.packages("devtools", dependencies = T)
devtools::install_local("path/to/your/folder/with/immunarch.tar.gz", dependencies=T)
That’s it, you can start using immunarch now!
If you run in any trouble, try the following steps:
Check your R version. Run version command in the console to get your R versions. If the R version is below 3.4.0 (for example, R version 3.1.0), try updating your R version to the latest one.
Check if your packages are outdated and update them. In RStudio you can run the “Update” button on top of the package list. In R console you can run the old.packages() command to view a list of outdated packages.
If you are on Mac and have issues like old packages can’t be updated, or error messages such as ld: warning: directory not found for option or ld: library not found for -lgfortran, this link will help you to fix the issue.
If you are working under Linux and have issues with library or have Fortran errors, see this link
If troubles still persist, message us on support@immunomind.io or create an issue in https://github.com/immunomind/immunarch/issues with the code that represents the issue and the output you get in the console.
Importing data into R is fairly simple. The gist of the typical TCR or BCR explorational data analysis workflow can be reduced to the next few lines of code:
# Load the data to the package
immdata = repLoad("path/to/your/folder/with/repertoires")
# If you folder contains metadata.txt file, immdata will have two elements:
# - immdata$data with a list of parsed repertoires
# - immdata$meta with the metadata file
# Compute and visualise overlap statistics
ov = repOverlap(immdata$data)
vis(ov)
# Cluster samples using K-means algorithm applied to the number of overlapped clonotypes
# and visualise the results
ov.kmeans = repOverlapAnalysis(ov, .method = "kmeans")
vis(ov.kmeans)
# Compute and visualise gene usage with samples, grouped by their disease status
gu = geneUsage(immdata$data)
vis(gu, .by="Status", .meta=immdata$meta)
# Compute Jensen-Shannon divergence among gene distributions of samples,
# cluster samples using the hierarchical clustering and visualise the results
gu.clust = geneUsageAnalysis(gu, .method = "js+hclust")
vis(gu.clust)
# Compare diversity of repertoires and visualise samples, grouped by two parameters
div = repDiversity(immdata$data, .method = "chao1")
vis(div, .by=c("Status", "Treatment"), .meta=immdata$meta)
# Tweak and fix the visalisation of diversity estimates to make the plot publication-ready
div.plot = vis(div, .by=c("Status", "Treatment"), .meta=immdata$meta)
fixVis(div.plot)
If you want to test the package without parsing any data, you can load a small test dataset provided along with the package. Load the data with the following command:
data(immdata)
immunarch comes with its own data format, including tab-delimited columns that can be specified as follows:
“Clones” - count or number of barcodes (events, UMIs) or reads;
“Proportion” - proportion of barcodes (events, UMIs) or reads;
“CDR3.nt” - CDR3 nucleotide sequence;
“CDR3.aa” - CDR3 amino acid sequence;
“V.name” - names of aligned Variable gene segments;
“D.name” - names of aligned Diversity gene segments or NA;
“J.name” - names of aligned Joining gene segments;
“V.end” - last positions of aligned V gene segments (1-based);
“D.start” - positions of D’5 end of aligned D gene segments (1-based);
“D.end” - positions of D’3 end of aligned D gene segments (1-based);
“J.start” - first positions of aligned J gene segments (1-based);
“VJ.ins” - number of inserted nucleotides (N-nucleotides) at V-J junction (-1 for receptors with VDJ recombination);
“VD.ins” - number of inserted nucleotides (N-nucleotides) at V-D junction (-1 for receptors with VJ recombination);
“DJ.ins” - number of inserted nucleotides (N-nucleotides) at D-J junction (-1 for receptors with VJ recombination);
“Sequence” - full nucleotide sequence.
The package provides several IO functions:
repLoad - to load the repertoires, having compatible format.
repSave - to save changes and to write the repertoire data to a file in a specific format (immunarch, VDJtools or MonetDB).
repLoad has the .format argument that sets the format for input repertoire files. Do not provide it if you want immunarch to detect the formats and parse files automatically! In case you want to force the package to parse the data in a specific format, you can choose one of the several options for the argument:
"immunarch" - current software tool, in case you forgot it :)
"immunoseq" - http://www.adaptivebiotech.com/immunoseq
"mitcr" - https://github.com/milaboratory/mitcr
"mixcr" - https://github.com/milaboratory/mixcr
"migec" - http://migec.readthedocs.io/en/latest/
"migmap" - https://github.com/mikessh/migmap
"tcr" - https://imminfo.github.io/tcr/
"vdjtools" - http://vdjtools-doc.readthedocs.io/en/latest/
"imgt" - http://www.imgt.org/HighV-QUEST/
These parsers will be available soon.
"imseq" - http://www.imtools.org/
"rtcr" - https://github.com/uubram/RTCR
"vidjil" - http://www.vidjil.org/
Please contact us if there are more file formats you want to be supported.
For parsing IgBLAST results process the data with MigMap first.
You can load the data either from a single file or from a folder with repertoire files. A single file can be loaded as follows:
# To load the data from a single file without forcing the data format:
immdata <- repLoad("path/to/your/folder/immunoseq_1.txt")
# To load the data from a single ImmunoSEQ file go with:
immdata <- repLoad("path/to/your/folder/immunoseq_1.txt", .format = "immunoseq")
In the second case you may want to provide a metadata file and locate it in the folder. It is necessary to name it exactly “metadata.txt”.
# For instance you have a following structure in your folder:
# >_ ls
# immunoseq1.txt
# immunoseq2.txt
# immunoseq3.txt
# metadata.txt
With the metadata repLoad will create a list in the environment with 2 elements, namely data and meta. All the data will be accessible simply from immdata$data.
Otherwise repLoad will create a list with the number of elements matching the number of your files. They will be accessible directly from immdata.
# To load the whole folder with every file in it type:
immdata <- repLoad("path/to/your/folder/")
# In order to do that your folder must contain metadata file named
# exactly "metadata.txt".
# In R, when you load your data:
# > immdata <- repLoad("path/to/your/folder/")
# > names(immdata)
# [1] "data" "meta"
# Suppose you do not have "metadata.txt":
# > immdata <- repLoad("path/to/your/folder/")
# > names(immdata)
# [1] "immunoseq_1" "immunoseq_2" "immunoseq_3"
The metadata has to be tab delimited file with first column named “Sample” and any number of additional columns with arbitrary names. The first column should contain base names of files without extensions in your folder.
| Sample | Sex | Age | Status |
|---|---|---|---|
| immunoseq_1 | M | 1 | C |
| immunoseq_2 | M | 2 | C |
| immunoseq_3 | F | 3 | A |
In order to import data from the external databases you have to create a connection to this database and then load the data. Make sure that the table format in your database matches the immunarch’s format.
To illustrate the use of external database, here is an example demonstrating data loading to the local MonetDB database:
# Your list of repertoires in immunarch's format
DATA
# Metadata data frame
META
# Create a temporary directory
dbdir = tempdir()
# Create a DBI connection to MonetDB in the temporary directory.
con = DBI::dbConnect(MonetDBLite::MonetDBLite(), embedded = dbdir)
# Write each repertoire to MonetDB. Each table has corresponding name from the DATA
for (i in 1:length(DATA)) {
DBI::dbWriteTable(con, names(DATA)[i], DATA[[i]], overwrite=TRUE)
}
# Create a source in the temporary directory with MonetDB
ms = MonetDBLite::src_monetdblite(dbdir = dbdir)
res_db = list()
# Load the data from MonetDB to dplyr tables
for (i in 1:length(DATA)) {
res_db[[names(DATA)[i]]] = dplyr::tbl(ms, names(DATA)[i])
}
# Your data is ready to use
list(data = res_db, meta = META)
You might want to make a list with data, which is a list of tables of repertoires, and the meta, containing metadata:
# Load the data to the immdata variables. Metadata file "metadata.txt" will be found automatically.
immdata = repLoad("your_folder", "optionally_your_format")
# Repertoires
immdata$data
# Metadata
immdata$meta
immunarch is compatible with following sources:
R data frames (for common applications)
R data tables (for speed, although requires lots of RAM)
MonetDB-like databases that support both DBI and dplyr (best choice, although you have to be a bit familiar with dplyr)
Apache Spark (if you are expierienced and comfortable with it)
You can find the introduction to dplyr here: https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
The function returns the most abundant clonotypes for the given repertoire:
top(immdata$data[[1]])
## # A tibble: 10 x 15
## Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <int>
## 1 1647 0.0206 TGCGCC… CASSQE… TRBV4… TRBD1 TRBJ2… 16 18
## 2 1409 0.0176 TGCGCC… CASSYR… TRBV4… TRBD1 TRBJ2… 11 13
## 3 747 0.00934 TGTGCC… CATSTN… TRBV15 TRBD1 TRBJ2… 11 16
## 4 511 0.00639 TGCGCC… CASQGD… TRBV4… TRBD1 TRBJ1… 8 13
## 5 463 0.00579 TGTGCC… CATSIG… TRBV15 TRBD2 TRBJ2… 11 19
## 6 450 0.00562 TGTGCC… CASSPW… TRBV27 TRBD1 TRBJ1… 11 16
## 7 336 0.0042 TGTGCC… CASSEE… TRBV2 TRBD1 TRBJ1… 15 17
## 8 335 0.00419 TGCGCC… CASSQD… TRBV4… TRBD1 TRBJ2… 16 21
## 9 287 0.00359 TGTGCC… CASSWV… TRBV6… TRBD1 TRBJ2… 12 20
## 10 268 0.00335 TGCGCC… CASSQD… TRBV4… TRBD1 TRBJ2… 16 25
## # … with 6 more variables: D.end <int>, J.start <int>, VJ.ins <dbl>,
## # VD.ins <dbl>, DJ.ins <dbl>, Sequence <lgl>
Conveniently, functions are vectorised over the list of data frames; and coding(immdata$data) in the example below returns a list of data frames with coding sequences:
coding(immdata$data[[1]])
The next one operates in a similar fashion:
noncoding(immdata$data[[1]])
Now, the computation of the number of filtered sequences is straightforward:
nrow(inframes(immdata$data[[1]]))
And for the out-of-frame clonotypes:
nrow(outofframes(immdata$data[[1]]))
It is simple to subset data frame according to labels in the specified index. In the example the resulting data frame contains only records with ‘TRBV10-1’ V gene:
filter(immdata$data[[1]], V.name == 'TRBV10-1')
## # Source: lazy query [?? x 15]
## # Database: MonetDBEmbeddedConnection
## Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <int>
## 1 7 0.0000875 TGCGCC… CASSES… TRBV1… TRBD2 TRBJ2… 16 20
## 2 5 0.0000625 TGCGCC… CASSDG… TRBV1… TRBD1 TRBJ2… 13 15
## 3 4 0.00005 TGCGCC… CASSEA… TRBV1… TRBD2 TRBJ2… 14 21
## 4 4 0.00005 TGCGCC… CASSEG… TRBV1… TRBD2 TRBJ2… 14 15
## 5 4 0.00005 TGCGCC… CASSAP… TRBV1… TRBD2 TRBJ2… 12 13
## 6 4 0.00005 TGCGCC… CATLRS… TRBV1… TRBD1 TRBJ2… 6 7
## 7 3 0.0000375 TGCGCC… CASSES… TRBV1… TRBD2 TRBJ2… 16 20
## 8 3 0.0000375 TGCGCC… CASSES… TRBV1… TRBD2 TRBJ2… 16 17
## 9 2 0.000025 TGCGCC… CASSGE… TRBV1… TRBD2 TRBJ2… 12 13
## 10 2 0.000025 TGCGCC… CASSES… TRBV1… TRBD2 TRBJ2… 13 15
## # … with more rows, and 6 more variables: D.end <int>, J.start <int>,
## # VJ.ins <dbl>, VD.ins <dbl>, DJ.ins <dbl>, Sequence <lgl>
For each task in this section immunarch includes separate functions that are generally self-explanatory and are written in camel-case. Shorter names for the functions are also availbalie, although the authors do not recommend their usage for the sake of clarity and readability of the code. The latter are mentionned in parantheses. Basic analysis functions are:
repExplore (rep.ex) - to compute basic statistics, such as number of clones or distributions of lengths and counts. To explore them you need to pass the statistics, e.g. count, to the .method.
repClonality (rep.clo) - to compute the clonality of repertoires.
repOverlap (rep.ov) - to compute the repertoire overlap.
repOverlapAnalysis (rep.ova) - to analyse the repertoire overlap, including different clustering procedures and PCA.
geneUsage (gen.us) - to compute the distributions of V or J genes.
geneUsageAnalysis (gen.usa) - to analyse the distributions of V or J genes, including clustering and PCA.
repDiversity (rep.div) - to estimate the diversity of repertoires.
trackClonotypes (tr.clo) - to analyse the dynamics of repertoires across time points.
spectratype - to compute spectratype of clonotypes.
Output of each function could be passed directly to the vis function - the general function for visualisation. Examples of usage are written below.
Plots generated by the vis function can be passed to fixVis - built-in software tool for making publication-ready plots. See the “Quick start” section for usage.
For the basic exploratory analysis such as comparing of number of reads / UMIs per repertoire or distribution use the function repExplore.
exp_len = repExplore(immdata$data, .method = "len", .col = "aa")
exp_cnt = repExplore(immdata$data, .method = "count")
exp_vol = repExplore(immdata$data, .method = "volume")
p1 = vis(exp_len)
p2 = vis(exp_cnt)
p3 = vis(exp_vol)
p1
grid.arrange(p2, p3, ncol = 2)
# You can group samples by their metainformation
p4 = vis(exp_len, .by="Status", .meta=immdata$meta)
p5 = vis(exp_cnt, .by="Sex", .meta=immdata$meta)
p6 = vis(exp_vol, .by=c("Status", "Sex"), .meta=immdata$meta)
p4
grid.arrange(p5, p6, ncol = 2)
One of the ways to estimate the diversity of samples is to evaluate clonality. repClonality measures the amount of the most or the least frequent clonotypes. There are several methods to assess clonality, let us take a view of them. The clonal.prop method computes the proportion of repertoire occupied by the pools of cell clones:
imm_pr = repClonality(immdata$data, .method = "clonal.prop")
imm_pr
## Clones Percentage Clonal.count.prop
## A2-i129 18 10.1 4.195413e-04
## A2-i131 31 10.1 6.902387e-04
## A2-i133 8 10.3 1.947799e-04
## A2-i132 140 10.0 3.323284e-03
## A4-i191 4 11.5 1.209995e-04
## A4-i192 8 10.2 2.274472e-04
## MS1 3 12.6 8.001067e-05
## MS2 74 10.0 1.580656e-03
## MS3 2 10.8 3.762015e-05
## MS4 279 10.0 5.344828e-03
## MS5 2 10.3 5.141917e-05
## MS6 205 10.0 3.750114e-03
## attr(,"class")
## [1] "matrix" "immunr_clonal_prop"
The top method considers the most abundant cell clonotypes:
imm_top = repClonality(immdata$data, .method = "top")
imm_top
## 10 100 1000 3000 10000 30000 1e+05
## A2-i129 0.0806625 0.1661375 0.2869000 0.3840000 0.5676000 0.8387000 1
## A2-i131 0.0666625 0.1506875 0.2912000 0.3928875 0.5636000 0.8136000 1
## A2-i133 0.1101500 0.1820250 0.3085375 0.4093875 0.5957375 0.8616000 1
## A2-i132 0.0238500 0.0854250 0.2423750 0.3728625 0.5723125 0.8484125 1
## A4-i191 0.1729375 0.3158875 0.4576250 0.5482750 0.7117750 0.9617750 1
## A4-i192 0.1142750 0.2141250 0.3629250 0.4803500 0.6744500 0.9353375 1
## MS1 0.1957000 0.3068250 0.4328750 0.5132625 0.6563125 0.9063125 1
## MS2 0.0455875 0.1091125 0.2059000 0.2919125 0.4797750 0.7898000 1
## MS3 0.1673750 0.2293625 0.2969875 0.3522750 0.4604625 0.7104625 1
## MS4 0.0191000 0.0626000 0.1737000 0.2684125 0.4537375 0.7225000 1
## MS5 0.2093375 0.2884500 0.3949000 0.4813500 0.6388000 0.8888000 1
## MS6 0.0296375 0.0750125 0.1861625 0.2783250 0.4416875 0.6916875 1
## attr(,"class")
## [1] "matrix" "immunr_top_prop"
While the tail method deals with the least prolific clonotypes:
imm_tail = repClonality(immdata$data, .method = "tail")
imm_tail
## 1 3 10 30 100
## A2-i129 0.3902000 0.6478375 0.7675125 0.8353500 0.8743125
## A2-i131 0.4488875 0.6335125 0.7579000 0.8353625 0.9059250
## A2-i133 0.3725375 0.6166750 0.7446000 0.8172000 0.8704750
## A2-i132 0.3754875 0.6225375 0.7950750 0.9094500 0.9705000
## A4-i191 0.3015000 0.4948500 0.5897750 0.6648000 0.7293250
## A4-i192 0.3037750 0.5276000 0.6829875 0.7731625 0.8326125
## MS1 0.3756375 0.5337000 0.6197000 0.6793500 0.7409375
## MS2 0.4001750 0.7526375 0.8507500 0.9006875 0.9349875
## MS3 0.6063500 0.7091250 0.7536625 0.7829500 0.8027875
## MS4 0.5087375 0.7636875 0.8809875 0.9487375 0.9844625
## MS5 0.3787500 0.5660625 0.6630625 0.7151625 0.7536000
## MS6 0.5722375 0.7613500 0.8715875 0.9367875 0.9690500
## attr(,"class")
## [1] "matrix" "immunr_tail_prop"
Finally, the homeo method assesses the clonal space homeostasis, i.e., the proportion of the repertoire occupied by the clones of a given size:
imm_hom = repClonality(immdata$data, .method = "homeo")
imm_hom
## Rare (0 < X <= 1e-05) Small (1e-05 < X <= 1e-04)
## A2-i129 0 0.7479750
## A2-i131 0 0.7390625
## A2-i133 0 0.7246750
## A2-i132 0 0.7644250
## A4-i191 0 0.5739125
## A4-i192 0 0.6598500
## MS1 0 0.6038500
## MS2 0 0.8407250
## MS3 0 0.7473000
## MS4 0 0.8653125
## MS5 0 0.6467750
## MS6 0 0.8531875
## Medium (1e-04 < X <= 0.001) Large (0.001 < X <= 0.01)
## A2-i129 0.1217875 0.0920375
## A2-i131 0.1504750 0.0982750
## A2-i133 0.1334375 0.0651750
## A2-i132 0.1959250 0.0396500
## A4-i191 0.1396000 0.1568250
## A4-i192 0.1624375 0.0844625
## MS1 0.1223875 0.0972250
## MS2 0.0863375 0.0601875
## MS3 0.0519500 0.0630625
## MS4 0.1155875 0.0191000
## MS5 0.1024625 0.0662000
## MS6 0.1102750 0.0365375
## Hyperexpanded (0.01 < X <= 1)
## A2-i129 0.0382000
## A2-i131 0.0121875
## A2-i133 0.0767125
## A2-i132 0.0000000
## A4-i191 0.1296625
## A4-i192 0.0932500
## MS1 0.1765375
## MS2 0.0127500
## MS3 0.1376875
## MS4 0.0000000
## MS5 0.1845625
## MS6 0.0000000
## attr(,"class")
## [1] "matrix" "immunr_homeo"
grid.arrange(vis(imm_top), vis(imm_top, .by="Status", .meta=immdata$meta), ncol = 2)
grid.arrange(vis(imm_tail), vis(imm_tail, .by="Status", .meta=immdata$meta), ncol = 2)
grid.arrange(vis(imm_hom), vis(imm_hom, .by="Status", .meta=immdata$meta), ncol = 2)
Repertoire overlap is the most common approach to measure repertoire similarity. It is achieved by computation of specific statistics on clonotypes shared between given repertoires, also called “public” clonotypes. immunarch provides several indices: - number of public clonotypes (.method = "public") - a classic measure of overlap similarity.
overlap coefficient (.method = "overlap") - a normalised measure of overlap similarity. It is defined as the size of the intersection divided by the smaller of the size of the two sets.
Jaccard index (.method = "jaccard") - it measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.
Tversky index (.method = "tversky") - an asymmetric similarity measure on sets that compares a variant to a prototype. If using default arguments, it’s similar to Dice’s coefficient.
cosine similarity (.method = "cosine") - a measure of similarity between two non-zero vectors
Morisita’s overlap index (.method = "morisita") - a statistical measure of dispersion of individuals in a population. It is used to compare overlap among samples.
top-overlap - overlaps of the N most abundant clonotypes with sequentially growing N (.method = "top+METHOD")
The function that includes described methods is repOverlap. Again the output is easily visualised when passed to vis() function that does all the work:
imm_ov1 = repOverlap(immdata$data, .method = "public", .verbose = F)
imm_ov2 = repOverlap(immdata$data, .method = "morisita", .verbose = F)
grid.arrange(vis(imm_ov1), vis(imm_ov2, .text.size=1.5), ncol = 2)
vis(imm_ov1, "heatmap2")
To analyse the computed overlap measures function apply repOverlapAnalysis.
# Apply different analysis algorithms to the matrix of public clonotypes:
# "mds" - Multi-dimensional Scaling
repOverlapAnalysis(imm_ov1, "mds")
## Standard deviations:
## [1] 0 0 0 0
##
## Rotation:
## [,1] [,2]
## A2-i129 -110.709050 64.5877923
## A2-i131 -20.688100 29.5930468
## A2-i133 -148.856927 -135.3873972
## A2-i132 271.865359 -12.1997230
## A4-i191 -4.646556 -46.2700405
## A4-i192 -57.459268 156.0677169
## MS1 14.902501 -8.1401151
## MS2 17.526821 0.4664452
## MS3 3.978868 -9.9128964
## MS4 15.476432 -4.0909449
## MS5 17.478676 -4.3784903
## MS6 1.131243 -30.3353937
# "tsne" - t-Stochastic Neighbor Embedding
repOverlapAnalysis(imm_ov1, "tsne")
## DimI DimII
## A2-i129 -23.3467632 16.7965447
## A2-i131 -4.4367450 -64.5645986
## A2-i133 65.9439889 -11.9350892
## A2-i132 57.1586279 41.8758166
## A4-i191 20.7809897 4.4325925
## A4-i192 26.2717380 -35.1814953
## MS1 -68.9709820 3.4369293
## MS2 -10.6871143 -12.7217957
## MS3 -38.3194937 -33.1172355
## MS4 -0.4540292 -0.7545221
## MS5 -32.8084991 56.8425185
## MS6 8.8682819 34.8903347
## attr(,"class")
## [1] "matrix" "immunr_tsne"
# Visualise the results
vis(repOverlapAnalysis(imm_ov1, "mds"))
# Apply different analysis algorithms to the matrix of public clonotypes:
# "mds" - Multi-dimensional Scaling
repOverlapAnalysis(imm_ov1, "mds")
## Standard deviations:
## [1] 0 0 0 0
##
## Rotation:
## [,1] [,2]
## A2-i129 -110.709050 64.5877923
## A2-i131 -20.688100 29.5930468
## A2-i133 -148.856927 -135.3873972
## A2-i132 271.865359 -12.1997230
## A4-i191 -4.646556 -46.2700405
## A4-i192 -57.459268 156.0677169
## MS1 14.902501 -8.1401151
## MS2 17.526821 0.4664452
## MS3 3.978868 -9.9128964
## MS4 15.476432 -4.0909449
## MS5 17.478676 -4.3784903
## MS6 1.131243 -30.3353937
# "tsne" - t-Stochastic Neighbor Embedding
repOverlapAnalysis(imm_ov1, "tsne")
## DimI DimII
## A2-i129 -48.081766 24.0410239
## A2-i131 -42.157527 -16.2568961
## A2-i133 -10.058294 45.8984545
## A2-i132 57.148657 -2.3216393
## A4-i191 11.568840 14.6114506
## A4-i192 -9.232778 -19.8147357
## MS1 29.313483 51.6080489
## MS2 -12.395757 9.2580890
## MS3 23.670052 -44.0998498
## MS4 -3.126273 -0.7350184
## MS5 20.759052 -10.7725084
## MS6 -17.407690 -51.4164191
## attr(,"class")
## [1] "matrix" "immunr_tsne"
# Visualise the results
vis(repOverlapAnalysis(imm_ov1, "mds"))
# Clusterise the MDS resulting components using K-means
vis(repOverlapAnalysis(imm_ov1, "mds+kmeans"))
In order to build a massive table with all clonotypes from the list of repertoires use the pubRep function.
# Pass "nt" as the second parameter to build the public repertoire table using CDR3 nucleotide sequences
pr.nt = pubRep(immdata$data, "nt", .verbose = F)
pr.nt
## Source: local data table [516,021 x 14]
##
## # A tibble: 516,021 x 14
## CDR3.nt Samples `A2-i129` `A2-i131` `A2-i133` `A2-i132` `A4-i191`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 TGCGCC… 12 15 6 6 7 10
## 2 TGCGCC… 12 12 4 11 8 3
## 3 TGTGCC… 12 3 4 3 14 5
## 4 TGCGCC… 11 7 1 5 3 2
## 5 TGTGCC… 11 13 9 14 22 12
## 6 TGTGCC… 11 1 2 NA 1 4
## 7 TGTGCC… 11 4 7 1 5 3
## 8 TGTGCC… 11 5 2 8 8 2
## 9 TGTGCC… 10 3 3 2 7 3
## 10 TGTGCC… 10 5 1 5 5 3
## # … with 516,011 more rows, and 7 more variables: `A4-i192` <dbl>,
## # MS1 <dbl>, MS2 <dbl>, MS3 <dbl>, MS4 <dbl>, MS5 <dbl>, MS6 <dbl>
# Pass "aa+v" as the second parameter to build the public repertoire table using CDR3 aminoacid sequences and V alleles
# In order to use only CDR3 aminoacid sequences, just pass "aa"
pr.aav = pubRep(immdata$data, "aa+v", .verbose = F)
pr.aav
## Source: local data table [505,701 x 15]
##
## # A tibble: 505,701 x 15
## CDR3.aa V.name Samples `A2-i129` `A2-i131` `A2-i133` `A2-i132` `A4-i191`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CASSFQ… TRBV5… 12 21 6 7 7 11
## 2 CASSLE… TRBV5… 12 19 5 14 10 3
## 3 CASSLE… TRBV5… 11 1 2 6 1 1
## 4 CASSLG… TRBV1… 11 24 7 2 29 15
## 5 CASSLG… TRBV5… 11 14 2 6 9 3
## 6 CASSLG… TRBV5… 11 1 1 1 3 2
## 7 CASSLQ… TRBV5… 11 16 3 2 6 1
## 8 CASSSQ… TRBV5… 11 11 1 6 3 5
## 9 CAS~FF TRBV1… 11 14 19 10 29 2
## 10 CASSFG… TRBV1… 10 4 2 7 22 2
## # … with 505,691 more rows, and 7 more variables: `A4-i192` <dbl>,
## # MS1 <dbl>, MS2 <dbl>, MS3 <dbl>, MS4 <dbl>, MS5 <dbl>, MS6 <dbl>
# You can also pass the ".coding" parameter to filter out all noncoding sequences first:
pr.aav.cod = pubRep(immdata$data, "aa+v", .coding=T)
# Create a public repertoire with coding-only sequences using both CDR3 amino acid sequences and V genes
pr = pubRep(immdata$data, "aa+v", .coding = T, .verbose = F)
# Apply the filter subroutine to leave clonotypes presented only in healthy individuals
pr1 = pubRepFilter(pr, immdata$meta, c(Status = "C"))
# Apply the filter subroutine to leave clonotypes presented only in diseased individuals
pr2 = pubRepFilter(pr, immdata$meta, c(Status = "MS"))
# Divide one by another
pr3 = pubRepApply(pr1, pr2)
# Plot it
p = ggplot() + geom_jitter(aes(x = "Treatment", y = Result), data=pr3)
p
immunarch comes with a gene segments data table containing known gene segments for several species. In order to get the current statistics of genes, call the gene_stats() function:
gene_stats()
## alias species ighd ighj ighv igij igkj igkv iglj iglv
## 1 bt BosTaurus 0 0 0 0 0 0 0 0
## 2 cd CamelusDromedarius 0 0 0 0 0 0 0 0
## 3 clf CanisLupusFamiliaris 0 0 0 0 0 0 0 0
## 4 dr DanioRerio 7 7 0 3 0 0 0 0
## 5 hs HomoSapiens 30 13 224 0 5 64 7 69
## 6 macmul MacacaMulatta 24 7 19 0 4 83 5 0
## 7 mmc MusMusculusCastaneus 0 0 0 0 0 4 0 0
## 8 mmd MusMusculusDomesticus 0 0 0 0 0 2 0 0
## 9 musmus MusMusculus 27 8 234 0 8 109 3 5
## 10 oa OrnithorhynchusAnatinus 3 10 0 0 0 0 0 0
## 11 oc OryctolagusCuniculus 10 11 39 0 8 26 2 20
## 12 om OncorhynchusMykiss 9 7 6 0 0 0 0 0
## 13 rn RattusNorvegicus 30 4 113 0 6 132 2 8
## 14 smth MusMusculusMolossinus 0 0 0 0 0 1 0 0
## 15 smth MusMusculusMusculus 0 0 0 0 0 1 0 0
## 16 smth MusSpretus 0 0 0 0 0 2 0 2
## 17 ss SusScrofa 5 5 15 0 8 19 4 14
## traj trav trbd trbj trbv trdd trdj trdv trgj trgv vprv
## 1 46 0 0 0 0 5 3 2 6 14 0
## 2 0 0 0 0 0 0 0 7 2 2 0
## 3 0 0 2 8 19 0 0 0 7 8 0
## 4 0 0 0 0 0 0 0 0 0 0 0
## 5 52 47 3 14 60 3 4 6 3 9 1
## 6 0 0 2 15 58 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0 0 0
## 9 41 145 2 14 23 2 3 7 0 11 0
## 10 0 0 0 0 0 0 0 0 0 0 0
## 11 0 0 0 0 0 0 0 0 0 0 0
## 12 0 0 1 9 0 0 0 0 0 0 0
## 13 0 0 0 0 0 0 0 0 0 0 0
## 14 0 0 0 0 0 0 0 0 0 0 0
## 15 0 0 0 0 0 0 0 0 0 0 0
## 16 0 0 0 0 0 0 0 0 0 0 0
## 17 0 0 0 0 0 0 0 0 0 0 0
To compute the distributions of genes, immunarch includes the geneUsage function. It receives a repertoire or a list of repertoires as input and genes and species for which you want to get the statistics. E.g., if you plan to use TRBV genes of Homo Sapiens, you need to use the hs.trbv string in the function, where hs comes from the alias column and trbv is the gene name. Of if you plan to use IGHV genes of Mus Musculus, you need to use musmus.ighv:
# Next four function calls are equal. "hs" is from the "alias" column.
imm_gu = geneUsage(immdata$data, "hs.trbv")
# imm_gu = geneUsage(immdata$data, "HomoSapiens.trbv")
# imm_gu = geneUsage(immdata$data, "hs.TRBV")
# imm_gu = geneUsage(immdata$data, "HomoSapiens.TRBV")
imm_gu
## # A tibble: 48 x 13
## Names `A2-i129` `A2-i131` `A2-i133` `A2-i132` `A4-i191` `A4-i192` MS1
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 TRBV… 296 357 69 137 163 90 171
## 2 TRBV… 1491 2032 864 783 2401 1353 1329
## 3 TRBV… 134 91 144 142 140 95 113
## 4 TRBV… 1078 1210 874 1077 674 660 542
## 5 TRBV… 64 104 56 145 31 28 54
## 6 TRBV… 3980 3190 2222 2749 2228 3513 2772
## 7 TRBV… 285 404 76 314 99 109 84
## 8 TRBV… 293 334 238 391 187 200 193
## 9 TRBV… 444 626 413 384 218 210 160
## 10 TRBV… 699 1083 746 541 468 378 790
## # … with 38 more rows, and 5 more variables: MS2 <dbl>, MS3 <dbl>,
## # MS4 <dbl>, MS5 <dbl>, MS6 <dbl>
Gene distributions could be computed either using counts of individual clonotypes (.quant = "count") or not using them (.quant = NA).
In order to compute allele-level or family-level distributions, change the .type parameter.
Due to the ambiguity of gene alignments for some clonotypes, geneUsage has the following options to deal with ambiguous data:
.ambig = "exc" - filters out all clonotypes with ambiguous gene alignments.
.ambig = "inc" - includes all possible combinations of ambiguous gene alignments from the data. NOTE: ImmunoSEQ formats use non-standart gene segment names, so it is preferable to use this argument value with ImmunoSEQ formats. This argument is ON by default to ease the gene manipulation. Feel free to change it to "exc" in case of other data formats.
.ambig = "wei" - introduces weighted approach (divides by n (1/n) the frequency for each entry of the corresponding gene if there are n genes for a clonotype).
.ambig = "maj" - chooses only the first gene segment.
Parameter .norm controls whether immunarch will normalise the data to ensure the sum of all frequencies to be equal 1 or not.
You can visualise the histogram of gene usage in two different ways:
imm_gu = geneUsage(immdata$data, "hs.trbv", .norm = T, .ambig = "exc")
vis(imm_gu, .plot = "hist", .grid = T)
vis(imm_gu, .plot = "hist", .grid = F, .by = "Status", .meta = immdata$meta)
Another practical approach to the visualisation of group distributions are box plots:
vis(imm_gu, .by = "Status", .meta = immdata$meta, .plot = "box")
Sometimes tree maps could be used to reveal the differences in repertoires. They display the overall picture and the comparisons of related items, both at the same time, enabling intelligible exploration of the details:
imm_gu = geneUsage(immdata$data, "hs.trbv", .norm = T, .ambig = "exc")
vis(imm_gu, .plot = "tree")
To analyse the gene usage immunarch introduces the geneUsageAnalysis function. The .method parameter controls how the data is going to be preprocessed and analysed. geneUsageAnalysis includes following methods for preprocessing:
“js” - Jensen-Shannon Divergence.
“cor” - correlation.
“cosine” - cosine similarity.
“pca” - principal component analysis.
“mds” - multi-dimensional scaling.
“tsne” - t-Distributed Stochastic Neighbor Embedding.
And a few methods for the actual analysis:
“hclust” - clusters the data using hierarchical clustering.
“kmeans” - clusters the data using K-means.
“dbscan” - clusters the data using DBSCAN.
“permut” - permutation testing of differences between groups.
“anova” - computes ANOVA for each gene separately on data splitted to groups (without preprocessing). Results could be used with Tukey post-hoc test in order to detect significant differences among groups per gene.
“kruskall” - compute Kruskall for each gene separately on data splitted to groups (without preprocessing). Results could be used with Dunn test in order to detect significant differences between groups.
You can call several methods in a single line of code, which is probably the most powerful feature of the package. For instance, "js+hclust" first computes Jensen-Shannon divergence and then applies hierarchical clustering on the resulting distance matrix, whereas "anova" computes ANOVA on each gene separately after repertoires have been grouped:
imm_gu = geneUsage(immdata$data, "hs.trbv", .norm = T, .ambig = "exc")
imm_gu_js = geneUsageAnalysis(imm_gu, .method = "js", .verbose = F)
imm_gu_cor = geneUsageAnalysis(imm_gu, .method = "cor", .verbose = F)
gridExtra::grid.arrange(vis(imm_gu_js, .title = "Gene usage JS-divergence", .leg.title = "JS", .text.size=1.5), vis(imm_gu_cor, .title = "Gene usage correlation", .leg.title = "Cor", .text.size=1.5), ncol = 2)
Now let us visualise the output after both preprocessing and analysis:
imm_gu_js[is.na(imm_gu_js)] = 0
vis(geneUsageAnalysis(imm_gu, "cosine+hclust", .verbose = F))
#vis(geneUsageAnalysis(imm_gu, "js+dbscan", .verbose = F))
On top of that you can add clustering:
imm_cl_pca = geneUsageAnalysis(imm_gu, "js+pca+kmeans", .verbose = F)
imm_cl_mds = geneUsageAnalysis(imm_gu, "js+mds+kmeans", .verbose = F)
imm_cl_tsne = geneUsageAnalysis(imm_gu, "js+tsne+kmeans", .perp = .01, .verbose = F)
## Perplexity should be lower than K!
grid.arrange(vis(imm_cl_pca, .plot = "clust"), vis(imm_cl_mds, .plot = "clust"), vis(imm_cl_tsne, .plot = "clust"), ncol = 3)
Sevral approaches to the estimation of repertoire diversity are implemented in the repDiversity function. The .method parameter similarly to abovementionned functions sets the means for diversity estimation. You can choose one of the following methods:
chao1 - Chao1 estimator is a nonparameteric asymptotic estimator of species richness (number of species in a population).
hill - Hill numbers are a mathematically unified family of diversity indices (differing only by an exponent q).
div - True diversity, or the effective number of types, refers to the number of equally-abundant types needed for the average proportional abundance of the types to equal that observed in the dataset of interest where all types may not be equally abundant.
gini.simp - The Gini-Simpson index is the probability of interspecific encounter, i.e., probability that two entities represent different types.
inv.simp - Inverse Simpson index is the effective number of types that is obtained when the weighted arithmetic mean is used to quantify average proportional abundance of types in the dataset of interest.
gini - The Gini coefficient measures the inequality among values of a frequency distribution (for example levels of income). A Gini coefficient of zero expresses perfect equality, where all values are the same (for example, where everyone has the same income). A Gini coefficient of one (or 100 percents ) expresses maximal inequality among values (for example where only one person has all the income).
raref - Rarefaction is a technique to assess species richness from the results of sampling through extrapolation.
# Compute statistics and visualise them
# Chao1 diversity measure
div_chao = repDiversity(immdata$data, "chao1")
# Hill numbers
div_hill = repDiversity(immdata$data, "hill")
# D50
div_d50 = repDiversity(immdata$data, "d50")
# Ecological diversity measure
div_div = repDiversity(immdata$data, "div")
p1 = vis(div_chao)
p2 = vis(div_chao, .by=c("Status", "Sex"), .meta=immdata$meta)
p3 = vis(div_hill, .by=c("Status", "Sex"), .meta=immdata$meta)
p4 = vis(div_d50)
p5 = vis(div_d50, .by="Status", .meta=immdata$meta)
p6 = vis(div_div)
gridExtra::grid.arrange(p1, p2, ncol = 2)
gridExtra::grid.arrange(p3, p6, ncol = 2)
gridExtra::grid.arrange(p4, p4, ncol = 2)
imm_raref = repDiversity(immdata$data, "raref", .verbose = F)
grid.arrange(vis(imm_raref), vis(imm_raref, .by="Status", .meta=immdata$meta), ncol=2)
Tracking the dynamics of clonotypes across time points in different repertoires is a very useful tool in medicine. It is possible to track clonotypes based on their cdr3 sequences with trackClonotypes function, which will be applied to those defined by .which parameter. The .which parameter takes one of the three kinds of input:
.which input can be a vector of length two c(X, Y), where Y is the number of the most abundant clonotypes to take from X’th data frame in the input list.
.which input can be a character vector of sequences to take from all data frames.
.which input can be a data frame with two or three columns - first for the sequences, and second/third for the gene segments.
In order to estimate possible noise in the data immunarch uses two simple model based on either poisson distribution (.model = "poisson") or on normal distribution (.method = "norm"). Pass the none string to use no model.
# imm_tr = 1
# grid.arrange(vis(imm_tr, .plot = "line"), vis(imm_tr, .plot = "area"), ncol = 2)
Spectratype is a useful way to represent distributions of genes per length. Parameter .quant controls the quantity that used to compute proportions of genes - either by clonotype (id) or by number of clones per clonotype (count)
p1 = vis(spectratype(immdata$data[[1]], .quant = "id", .col = "aa", .gene = "v"))
p2 = vis(spectratype(immdata$data[[1]], .quant = "count", .col = "aa", .gene = "v"))
grid.arrange(p1, p2, ncol = 2)
The plotly package allows plotting of interactive graphics directly from the ggplot2 objects, which could be more useful than static plots in some cases.
require(plotly)
imm_top = repClonality(immdata$data, .method = "top")
plotly::ggplotly(vis(imm_top))
## Using Sample as id variables
imm_gu = geneUsage(immdata$data[[1]], "hs.trbv", .norm = T, .ambig = "exc")
plotly::ggplotly(vis(imm_gu, .plot = "hist", .ncol = 2, .grid = T))
## Using Names as id variables
imm_gu = geneUsage(immdata$data[c(5,6,7,8,9)], "hs.trbv", .norm = T, .ambig = "exc")
plotly::ggplotly(vis(imm_gu, .plot = "hist", .grid = F))
## Using Names as id variables
# plotly::ggplotly(vis(imm_gu, .by = "Status", .meta = immdata$meta, .plot = "box", .violin = F))
p1 = vis(spectratype(immdata$data[[1]], .quant = "id", .col = "aa", .gene = "v"))
plotly::ggplotly(p1)
The package is freely distributed under the GPL v3 license. You can read more about it here.
Additionally, we provide an annual subscription that includes next services:
Contact us at support@immunomind.io for more information.